Table of Contents

History
Architecture and training
Datasets
Quality evaluation
Impact and applications
List of notable text-to-image models
Explanatory notes
See also
References

Text to image model

An image conditioned on the prompt an astronaut riding a horse, by [[Hiroshige]], generated by Stable Diffusion 3.5, a large-scale text-to-image model first released in 2022

A text-to-image (T2I or TTI) model is a machine learning model which takes an input natural language prompt and produces an image matching that description.

Text-to-image models gradually began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, Midjourney, and Runway's Gen-4—began to be considered to approach the quality of real photographs and human-drawn art.

Text-to-image models are generally latent diffusion models, which perform the diffusion process in a compressed latent space rather than directly in pixel space. An autoencoder (often a variational autoencoder (VAE)) is used to convert between pixel space and this latent representation. These systems typically use a pretrained language or vision–language model to convert the input prompt into a text embedding, and a diffusion-based generative image model that produces images conditioned on that embedding. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.imagen-verge

History

Before the rise of deep learning in the 2010s, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.agnesezhu-2007
The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models.mansimov-2015


The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.mansimov-2015 Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set.mansimov-2015reed-2016
In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task.reed-2016frolov With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.reed-2016 Later systems include VQGAN-CLIP, XMC-GAN, and GauGAN2.

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.tc-dalle A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,tc-dalle-2 followed by Stable Diffusion that was publicly released in August 2022. In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video, Imagen Video, Midjourney, and Phenaki can generate video from text and/or text/image prompts.

Architecture and training

High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks (GANs) was once widely used in early days. Since 2020, diffusion models have become the popular option. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images or latent space, and use one or more auxiliary deep learning models to upscale or decode it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.imagen-paper

Datasets

Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Originally, the main focus of COCO was on the recognition of objects and scenes in images. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter.frolov
One of the largest open datasets for training text-to-image models is LAION-5B, containing more than 5 billion image-text pairs. This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs. Because of this, however, it also contains controversial content, which has led to discussions about the ethics of its use.

Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning. These datasets help avoid copyright issues and expand the diversity of training data.

Quality evaluation

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement.frolov
A common algorithmic metric for assessing image quality and diversity is the Inception Score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model.frolov

Impact and applications

See Artificial intelligence art.

List of notable text-to-image models

DALL-EOpenAIProprietary
DALL-E 2
DALL-E 3
GPT Image 1
Ideogram 0.1Ideogram
Ideogram 2.0
Ideogram 3.0
ImagenGoogle
Imagen 2
Imagen 3
Imagen 4
FireflyAdobe Inc.
MidjourneyMidjourney, Inc.
HalfmoonReve AI, Inc.
Stable DiffusionStability AIStability AI Community License
FluxBlack Forest LabsApache License
AuroraxAIProprietary
Runway Gen-2June 2023Runway AI, Inc.
Runway Gen-3 AlphaJune 2024
Runway FramesNovember 2024
Runway Gen-4March 2025
RecraftMay 2023Recraft, Inc.
AuraFlowFALApache License
HiDreamHiDream-AIMIT license

Explanatory notes


See also


References